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Abstract 

A set of intervals is independent when the intervals are pairwise disjoint. In the interval 
selection problem we are given a set I of intervals and we want to find an independent subset 
of intervals of largest cardinality. Let a (I) denote the cardinality of an optimal solution. 
We discuss the estimation of a(I) in the streaming model, where we only have one-time, 
sequential access to the input intervals, the endpoints of the intervals lie in {1,..., n}, and 
the amount of the memory is constrained. 

For intervals of different sizes, we provide an algorithm in the data stream model that 
computes an estimate a of a(I) that, with probability at least 2/3, satisfies |(1 — e)a(I) < 
a < Q!(I). For same-length intervals, we provide another algorithm in the data stream 
model that computes an estimate d of a (I) that, with probability at least 2/3, satisfies 
|(1 — e)a(I) < a < a (I). The space used by our algorithms is bounded by a polynomial in 
and logn. We also show that no better estimations can be achieved using o{n) bits of 
storage. 

We also develop new, approximate solutions to the interval selection problem, where 
we want to report a feasible solution, that use 0(a(I)) space. Our algorithms for the 
interval selection problem match the optimal results by Emek, Halldorsson and Rosen 
[Space-Constrained Interval Selection, ICALP 2012], but are much simpler. 


1 Introduction 


Several fundamental problems have been explored in the data streaming model; see [3 16 for an 
overview. In this model we have bounds on the amount of available memory, the data arrives 
sequentially, and we cannot afford to look at input data of the past, unless it was stored in our 
limited memory. This is effectively equivalent to assuming that we can only make one pass over 
the input data. 

In this paper, we consider the interval selection problem. Let us say that a set of intervals is 
independent when all the intervals are pairwise disjoint. In the interval selection problem, 
the input is a set I of intervals and we want to hnd an independent subset of largest cardinality. 
Let us denote by a (I) this largest cardinality. There are actually two different problems: one 
problem is finding (or approximating) a largest independent subset, while the other problem is 
estimating Q!(I). In this paper we consider both problems in the data streaming model. 

There are many natural reasons to consider the interval selection problem in the data 
streaming model. Firstly, the interval selection problem appears in many different contexts and 
several extensions have been studied; see for example the survey 
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Secondly, the interval selection problem is a natural generalization of the distinct elements 
problem: given a data stream of numbers, identify how many distinct numbers appeared in the 
stream. The distinct elements problem has a long tradition in data streams; see Kane, Nelson 
and Woodruff [12] for an optimal algorithm and references therein for a historical perspective. 

Thirdly, there has been interest in understanding graph problems in the data stream model. 
However, several problems cannot be solved within the memory constraints usually considered 
in the data stream model. This leads to the introduction by Feigenbaum et al. of the 
semi-streaming model, where the available memory is 0(|F| log*^^^^ I^D) being V the vertex set 
of the corresponding graph. Another model closely related to preemptive online algorithms was 
considered by Halldorsson et al. |^: there is an output buffer where a feasible solution is always 
maintained. 

Finally, geometrically-defined graphs provide a rich family of graphs where certain graph 
problems may be solved within the traditional model. We advocate that graph problems should 
be considered for geometrically-defined graphs in the data stream model. The interval selection 
problem is one such case, since it is exactly finding a largest independent set in the intersection 
graph of the input intervals. 

Previous works. Emek, Halldorsson and Rosen consider the interval selection problem 
with 0 (q;(I)) space. They provide a 2-approximation algorithm for the case of arbitrary intervals 
and a (3/2)-approximation for the case of proper intervals, that is, when no interval contains 
another interval. Most importantly, they show that no better approximation factor can be 
achieved with sublinear space. Since any 0(l)-approximation obviously requires H(a(I)) space, 
their algorithms are optimal. They do not consider the problem of estimating a (I). Halldorsson 
et al. consider maximum independent set in the aforementioned online streaming model. As 
mentioned before, estimating a(I) is a generalization of the distinct elements problems. See 
Kane, Nelson and Woodruff and references therein. 

Our contributions. We consider both the estimation of a(I) and the interval selection 
problem, where a feasible solution must be produced, in the data streaming model. We next 
summarize our results and put them in context. 

(a) We provide a 2-approximation algorithm for the interval selection problem using 0(a(I)) 
space. Our algorithm has the same space bounds and approximation factor than the 
algorithm by Emek, Halldorsson and Rosen |^, and thus is also optimal. However, our 
algorithm is considerably easier to explain, analyze and understand. Actually, the analysis 
of our algorithm is nearly trivial. This result is explained in Section 

(b) We provide an algorithm to obtain a value q;(I) such that 5(1 — e)a(I) < q;(I) < a(I) 
with probability at least 2/3. The algorithm uses 0(e“®log®n) space for intervals with 
endpoints in {1,..., n}. As a black-box subroutine we use a 2-approximation algorithm 
for the interval selection problem. This result is explained in Section 

(c) For same-length intervals we provide a (3/2)-approximation algorithm for the interval 
selection problem using 0(a(I)) space. Again, Emek, Halldorsson and Rosen provide 
an algorithm with the same guarantees and give a lower bound showing that the algorithm 
is optimal. We believe that our algorithm is simpler, but this case is more disputable. This 
result is explained in Section 

(d) For same-length intervals with endpoints in {1,..., n}, we show how to find in 0(e“^ log(l/e)+ 
logn) space an estimate q;(I) such that |(1 — e)a(I) < q;(I) < «(!) with probability at 
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least 2/3. This algorithm is an adaptation of the new algorithm in (c). This result is 
explained in Section 

(e) We provide lower bounds showing that the approximation ratios in (b) and (d) are 
essentially optimal, if we use o(n) space. Note that the lower bounds of Emek, Halldorsson 
and Rosen hold for the interval selection problem but not for the estimation of a (I). 
We employ a reduction from the one-way randomized communication complexity of Index. 
Details appear in Section 

The results in (a) and (c) work in a comparison-based model and we assume that a unit of 
memory can store an interval. The results in (b) and (d) are based on hash functions and we 
assume that a unit of memory can store values in {1,... ,n}. Assuming that the input data, 
in our case the endpoints of the intervals, is from {1,..., n} is common in the data streaming 
model. The lower bounds of (e) are stated at bit level. 

It is important to note that estimating a (I) requires considerably less space than computing 
an actual feasible solution with 0 (q;(I)) intervals. While our results in (a) and (c) are a 
simplification of the work of Emek et ah, the results in (b) and (d) were unknown before. 

As usual, the probability of success can be increased to 1 — <5 using 0(log(l/(5)) parallel 
repetitions of the algorithm and choosing the median of values computed in each repetition. 

2 Preliminaries 

We assume that the input intervals are closed. Our algorithms can be easily adapted to handle 
inputs that contain intervals of mixed types: some open, some closed, and some half-open. 

We will use the term ‘interval’ only for the input intervals. We will use the term ‘window’ 
for intervals constructed through the algorithm and ‘segment’ for intervals associated with the 
nodes of a segment tree. (This segment tree is explained later on.) The windows we consider 
may be of any type regarding the inclusion of endpoints. 

Eor each natural number n, we let [n] be the integer range {1,... ,n}. We assume that 
0 < e < 1/2. 

2.1 Leftmost and rightmost interval 

Consider a window W and a set of intervals I. We associate to W two input intervals. 

• The interval Leftmost{W) is, among the intervals of I contained in W, the one with smallest 
right endpoint. If there are many candidates with the same right endpoint, Leftmost(W) 
is one with largest left endpoint. 

• The interval RightmostiW) is, among the intervals of I contained in IT, the one with largest 
left endpoint. If there are many candidates with the same left endpoint, Rightmost{W) is 
one with smallest right endpoint. 

When IT does not contain any interval of I, then Leftmost{W) and Rightmost{W) are 
undefined. When IT contains a unique interval / G I, we have LeftmostiW) = Rightmost{W) = 
I. Note that the intersection of all intervals contained in IT is precisely LeftmostiW) H 
RightmostiW). 

In fact, we will consider LeftmostiW) and RightmostiW) with respect to the portion of the 
stream that has been treated. We relax the notation by omitting the reference to I or the portion 
of the stream we have processed. It will be clear from the context with respect to which set of 
intervals we are considering LeftmostiW) and RightmostiW). 
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2.2 Sampling 


We next describe a tool for sampling elements from a stream. A family of permutations 
H = {/i : [n] —>■ [n]} is e-min-wise independent if 


VA C [n] and Wy € X : 


< P^[/i(y) = min/i(A)] < 


Here, h € T-L is chosen uniformly at random. The family of all permutations is 0-min-wise 
independent. However, there is no compact way to specify an arbitrary permutation. As 
discussed by Broder, Charikar and Mitzenmacher [^, the results of Indyk |10| can be used 
to construct a compact, computable family of permutations that is e-min-wise independent. 
See [i|| 5[5] for other uses of e-min-wise independent permutations. 


Lemma 1. For every e G (0,1/2) and n > 0 there exists a family of permutations (n, e) = {h : 
[n] —)■ [n]} with the following properties: (i) 'H{n,e) has permutations; (ii) ?f(n, e) 

is e-min-wise independent; (Hi) an element of'H{n,e) can be chosen uniformly at random in 
0 (log(l/e)) time; (iv) for h G T-L{n,£) and x,y £ [n], we can decide with 0(log(l/e)) arithmetic 
operations whether h{x) < h{y). 


Proof. Indyk showed that there exist constants ci, C 2 > 1 such that, for any e > 0 and any 
family PL' of C 2 log(l/e)-wise independent hash functions [m] —>• [m], it holds the following: 


em 


yx C [m] with |A| < — and Vy G [m] \ X : 

Cl 

—— < Pr [h'(y) <minh'(A)] < 

A -Fl “ h'ew^ ^ “ A -h 1 


1 + e 


Set m = cxnje > n and let PL' = {h ': [m] —>• [m]} be a family of C 2 log(l/e)-wise independent 
hash functions. Since n = emlci, the result of Indyk implies that 


VA C [n] and Vy G [n] \ A : 


I ^1 ^^ < Pr [/i'(y) < min/i'(A)] < ^ . 

\X\ + 1 ~ h'ew^ ^ ^ ~ \X\ +I 


Each hash function h' G PL' can be used to create a permutation h' : [n] —>■ [n]: define h'{i) as 
the position of {h'{i),i) in the lexicographic order of {{h'{i),i) \ i G [n]}. Consider the set of 
permutations PL' = {h' : [n] —)• [n] \ h' G PL'}. For each A C [n] and y G [n] \ A we have 


1 — e 


W + 


- < Pr T/iVy) < min/ iTA)1 
1 h'&W 


< Pr 

h'&W 


h'{y) < min h'{X) 


and 


Pr 

h'&W 


h'{y) < min h'{X) 


< 

Pr 

h'Gn' 

[h'{y) < 

min 

< 

Pr 

[h'{y) < 

min 


h'Gn' 



1 + 

£ 1 


< 

-f 

^ - 


A -|- 1 m 



h'&w 


1 + 2 £ 

X\ -Fl’ 

where we have used that h'{y) = min/i'(A) corresponds to a collision and m > nje. We can 
rewrite this as 


VA C [n] and Vy G A : 


1 — £ 
|A| 


— 

h'ew 


h'{y) = min h'{X) 


< 


1 + 2£ 
|A| ■ 


4 



















Using e/2 instead of s in the discussion, the lower bound becomes (1 — e/2)/\X\ > (1 — £)/|^| 
and the upper bound becomes (1 + e)/|X|, as desired. Standard constructions using polynomials 
over finite fields can be used to construct a family %' = {h'■. [m] —)• [m]} of C 2 log(l/e)-wise 
independent hash functions such that: %' has hash functions; an element of %' can 

be chosen uniformly at random in 0(log(l/e)) time; for h' G %' and x G [n] we can compute 
h'{x) using 0(log(l/e)) arithmetic operations. 

This gives an implicit description of our desired set of permutations %' satisfying (i)-(iii). 
Moreover, while computing h'{x) for h' G %' is demanding, we can easily decide whether 
h'{x) < h'{y) by computing and comparing {h'(x),x) and {h'{y),y). □ 


Let us explain now how to use Lemma to make a (nearly-uniform) random sample. We 
learned this idea from Datar and Muthukrishnan j^. Consider any fixed subset X C [n] and 
let % = 'H{n,e) be the family of permutations given in Lemma An %-random element 
s of A is obtained by choosing a hash function h € Ti uniformly at random, and setting 
s = argmin{/i(t) | t G X}. It is important to note that s is not chosen uniformly at random 
from X. However, from the definition of e-min-wise independence we have 


Vx G A : 


1 — e 
|A| 


< Pr[s 


In particular, we obtain the following 


x\ < 


1 + £ 
|A| 


VF C A : 


|A| 


< Pr[s G F] < 


(l+g)l>^ 

|A| 


This means that, for a fixed F, we can estimate the ratio |F|/|A| using L^-random samples from 
A repeatedly, and counting how many belong to F. 

Using H-random samples has two advantages for data streams with elements from [n]. 
Through the stream, we can maintain an L^-random sample s of the elements seen so far. For 
this, we select h £ T-L uniformly at random, and, for each new element a of the stream, we check 
whether h{a) < h{s) and update s, if needed. An important feature of sampling in such way is 
that s is almost uniformly at random among those appearing in the stream, without counting 
multiplicities. The other important feature is that we select s at its first appearance in the 
stream. Thus, we can carry out any computation that depends on s and on the portion of the 
stream after its first appearance. For example, we can count how many times the L^-random 
element s appears in the whole data stream. 

We will also use Ti to make conditional sampling: we select H-random samples until we get 
one satisfying a certain property. To analyze such technique, the following result will be useful. 


Lemma 2. Let Y C A C [n] and assume that 0 < e < 1/2. Consider the family of permutations 
% = %{n,e) from Lemma^ and take a LL-random sample s from A. Then 

1 — 46 1 + 4 £ 

Vy G F : < Pr[s = y \ s £ Y] < 

Proof. Consider any y £Y. Since s = y implies s G F, we have 


Pr[s = y \ s £Y] 


= y] . |X| 

Pr[s G FI - (i-g)AI 
^ J |X| 


1 + e 1 / . ^ 1 

—< (1 + 4 £)^, 


where in the last inequality we used that £ < 1/2. Similarly, we have 


Pr[s = y \ s £Y] > 


1—e 

Al 


(i+UAI 

Al 


1-e 1 ^ 1 

TT^'\y\ ~ 


which completes the proof 


□ 
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partition of M 


Figure 1: At the bottom there is a partition of the real line. Filled-in disks indicate that the 
endpoint is part of the interval; empty disks indicate that the endpoint is not part of the interval. 
At the top, we show the split of some optimal solution J* (dotted blue) into and 

3 Largest independent subset of intervals 

In this section we show how to obtain a 2-approximation to the largest independent subset of I 
using 0(a(I)) space. 

A set W of windows is a partition of the real line if the windows in W are pairwise 
disjoint and their union is the whole M. The windows in W may be of different types regarding 
the inclusion of endpoints. See Figure for an example. 

Lemma 3. Let I be a set of intervals and let W be a partition of the real line with the following 
properties: 

• Each window ofW contains at least one interval from I. 

• For each window W G W, the intervals of I contained in W pairwise intersect. 

Let J be any set of intervals constructed by selecting for each window W of W an interval ofl 
contained in W. Then |JI| > a(I)/2. 

Proof. Let us set k = |W|. Consider a largest independent set of intervals J* C I. We have 
|JI*| = a(I). Split the set J* into two sets and follows: contains the intervals 

contained in some window of W and contains the other intervals. See Figure for an example. 
The intervals in are pairwise disjoint and each of them intersects at least two consecutive 
windows from W, thus |JIq| < k — 1. Since all intervals contained in a window of W pairwise 
intersect, has at most one interval per window. Thus |J^| < k. Putting things together we 
have 

a{I) = in = ini + lJcl < k-l + k = 2 k- 1 . 

Since JI contains exactly |W| = k intervals, we obtain 

2 - |Jf| = 2k > 2fc- 1 > a(I), 


which implies the result. □ 

We now discuss the algorithm. Through the processing of the stream, we maintain a partition 
W of the line so that W satisfies the hypothesis of Lemma To carry this out, for each 
window IF of W we store the intervals Leftmost{W) and Rightmost(W). See Figure for 
an example. To initialize the structures, we start with a unique window W = {M} and set 
Leftmost{W) = Rightmost{W) = Lq, where Iq is the first interval of the stream. With such 
initialization, the hypothesis of Lemmahold and Leftmost{) and Rightmost{) have the correct 
values. 
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input intervals 


-> M 

° * ° ^ partition of M 

O-• 

- - - Leftmost {) 

- - - Rightmost {) 

- — — LeftmostQ H RightmostQ 

Figure 2: Data maintained by the algorithm. 


With a few local operations, we can handle the insertion of a new interval in I. Consider 
a new interval I of the stream. If I is not contained in any window of W, we do not need 
to do anything. If I is contained in a window W, we check whether it intersects all intervals 
contained W. If I intersects all intervals contained in W, we may have to update Leftmost(W) 
and RightmostiW). If I does not intersect all the intervals contained in W, then I is disjoint 
from Leftmost[W) PI Rightmost(W). In such a case we can use one endpoint of Leftmost{W) n 
Rightmost{W) to split the window W into two windows Wi and 11^2, one containing I and the 
other containing either LeftmostiW) or Rightmost(W), so that the assumptions of Lemma 
are restored. We also have enough information to obtain LeftmostO and Rightmosti) for the 
new windows Wi and IF 2 . Figures]^ andshow two possible scenarios. See the pseudocode in 
Figure for a more detailed description. 

Lemma 4. The policy described in Figure^ maintains a set of windows W and a set of intervals 
J that satisfy the assumptions of Lemma\^ 

Proof. A simple case analysis shows that the policy maintains the assumptions of Lemma and 
the properties of Leftmostif) and Rightmostf). 

Consider, for example, the case when the new interval [x, y] is contained in a window 
VF G W and [£,r] = Leftmost{W) n Rightmost{W) is to the left of [x,y]. In this case, the 
algorithm will update the structures in lines 11-16 and lines 24-27. See Figure for an 
example. By inductive hypothesis, all the intervals in I \ {[x,y]} contained in W intersect 
[£,r]. Note that Wi = W r\ (—oo,r], and thus only the intervals contained in W with right 
endpoint r are contained in Wi. By the inductive hypothesis, LeftmostiW) has right endpoint 
r and has largest left endpoint among all intervals contained in Wi. Thus, when we set 
RightmostiWi) = Leftmost{Wi) = LeftmostiW), the correct values for Wi are set. As for 
W 2 = VF n (r, + 00 ), no interval of I \ {[x, y]} is contained in IF 2 , thus [x, y] is the only interval 
contained in IF 2 and setting Rightmost{W 2 ) = LeftmostiW 2 ) = [x,y] we get the correct values 
for IF 2 . Lines 24-27 take care to replace IF in W by IFi and kF 2 . For IFi and kF 2 we set the 
correct values of Leftmosti) and Rightmosti) and the assumptions of Lemmahold. For the 
other windows of W \ {W} nothing is changed. □ 

We can store the partition of the real line W using a dynamic binary search tree. With this, 
line 1 and lines 24-25 take 0(log|W|) = 0(logQ;(I)) time. The remaining steps take constant 
time. The space required by the data structure is 0(|W|) = 0 (q;(I)). This shows the following 
result. 

Theorem 5. Let I be a set of intervals in the real line that arrive in a data stream. There is a 
data stream algorithm to compute a 2-approximation to the largest independent subset of I that 
uses 0 (q:(I)) space and handles each interval of the stream in 0(loga(I)) time. 
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Process interval I = [x, y] 

1. find the window VP of W that contains x 

2. [£, r] ^Leftmost(W) D Rightmost{W) 

3. if y G VP then 

4. if [£, r] n [x, y] / 0 then 

5. if ^ < X or (f = X and [x,y] C Rightmost(W)) then 

6 . Rightmost{W)[x,y] 

7. if y < r or (y = r and [x, y] C Leftmost(W)) then 

8 . Leftmost{W) <r- [x,y] 

9. else (* [i,r] and [x,y] are disjoint; split VP *) 

10. if X > r then (* [^, r] to the left of [x, y] *) 

11. make new windows VPi = VP n (—oo, r] and VP 2 = VP n (r, +00) 

12. Leftmost(Wi) •(— Leftmost{W) 

13. Rightmost{Wi) •(— Leftmost{W) 

14. I' LeftmostfW) 

15. Leftmost{W 2 )[x,y] 

16. Rightmost{W 2 )[x,y] 

17. else {* y < i, [i,r] to the right of [x,y]*) 

18. make new windows VPi = VP n (—00, i) and VP 2 = VP n [£, +00) 

19. Leftmost(Wi) [x,y] 

20. RightmostfWi) ^ [x,y] 

21 . Leftmost{W 2 ) ^ Rightmost{W) 

22 . Rightmost{W 2 ) ^ Rightmost{W) 

23. I' ^ Rightmost(W) 

24. remove VP from W 

25. add VPi and VP 2 to W 

26. remove from Jf the interval that is contained in VP 

27. add to J the intervals [x, y] and I' 

28. (* If y ^ VP then [x, y] is not contained in any window *) 


Figure 3: Policy to process a new interval [x,y]. W maintains a partition of the real line and J 
maintains a 2-approximation to a (I). 

new interval 

- - - input intervals ■- — . ^ _ 

— contained in VP — 

--► M ^-► 

--X IP X-► partition of M -- ^ ^ -* 


Leftmost {) 
RightmostQ 



Figure 4: Example handled by lines 5-6 of the algorithm. The endpoints represented by crosses 
may be in the window or not. 
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new interval 


input intervals 
contained in W 


^ X-► partition of M -x 


Leftmost 0 
RightmostQ 



Figure 5: Example handled by lines 11-16 of the algorithm. The endpoints represented by 
crosses may be in the window or not. 


4 Size of largest independent set of intervals 

In this section we show how to obtain a randomized estimate of the value a(I). We will assume 
that the endpoints of the intervals are in [n]. 

Using the approach of Knuth [^, the algorithm presented in Section]^ can be used to define 
an estimator whose expected value lies between q:(I)/ 2 and a(I). However, it has large variance 
and we cannot use it to obtain an estimate of q;(I) with good guarantees. The precise idea is as 
follows. 

The windows appearing through the algorithm of Section naturally dehne a rooted binary 
tree T, where each node represents a window. At the root of T we have the whole real line. 
Whenever a window W is split into two windows W and W'\ in T we have nodes for W and W" 
with parent W. The size of the output is the number of windows in the final partition, which is 
exactly the number of leaves in T. Knuth 13 shows how to obtain an unbiased estimator of the 
number of leaves of a tree. This estimator is obtained by choosing random root-to-leaf paths. (At 
each node, one can use different rules to select how the random path continues.) Unfortunately, 
the estimator has very large variance and cannot be used to obtain good guarantees. Easy 
modifications of the method do not seem to work, so we develop a different method. 

Our idea is to carefully split the window [1, n] into segments, and compute for each segment 
a 2-approximation. If each segment contains enough disjoint intervals from the input, then we 
do not do much error combining the results of the segments. We then have to estimate the 
number of segments in the partition of [l,re] and the number of independent intervals in each 
segment. First we describe the ingredients, independent of the streaming model, and discuss 
their properties. Then we discuss how those ingredients can be computed in the data streaming 
model. 


4.1 Segments and their associated information 

Let T be a balanced segment tree on the n segments [i, i + 1), i G [n]. Each leaf of T corresponds 
to a segment [i, i +1) and the order of the leaves in T agrees with the order of their corresponding 
intervals along the real line. Each node u of T has an associated segment, denoted by S{v), that 
is the union of all segments stored at its descendants. It is easy to see that, for any interval 
node V with children and Vr, the segment S{v) is the disjoint union of S{v() and S{vr)- See 
Eigurej^for an example. We denote the root of T by r. We have S{r) = [l,n -|- 1). 
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Figure 6: Segment tree for n = 16. 


Let S be the set of segments associated with all nodes of T. Note that § has 2re — 1 elements. 
Each segment S' G § contains the left endpoint and does not contain the right endpoint. 

For any segment S G S, where S / S(r), let vr(S) be the “parent” segment of S: this is the 
segment stored at the parent of v, where S{v) = S. 

For any S G S, let /3(S) be the size of the largest independent subset of{lGl|/cS}. 
That is, we consider the restriction of the problem to intervals of I contained in S. Similarly, 
let /3(S) be the size of a feasible solution computed for {/ G I | / C S} by the 2-approximation 
algorithm described in Section or by the algorithm of Emek, Halldorsson and Rosen . We 
thus have f5{S) > /3(S) > /3(S)/2 for all S G S. 

Lemma 6. Let S' C S 6 e a set of segments with the following properties: 

(i) S(r) is the disjoint union of the segments in S', and, 

(ii) for each S G S', we have f3{Tr{S)) > 2e“^ [logn]. 

Then, 

a(I) > a(I). 

seS' ^ ^ 

Proof. Since the segments in S' are disjoint because of hypothesis (i), we can merge the solutions 
giving fd{S) independent intervals, for all 5 G S', to obtain a feasible solution for the whole I. 
We conclude that 

a(I) > > ^/3(5). 

SeS' SeS' 

This shows the first inequality. 

Let S be the set of leafmost elements in the set of parents {vr(S') | S G S'}. Thus, each S G S 
has some child in S' and no descendant in S. Eor each S' G S, let n 7 ’(S) be the path in T from 
the root to S. By construction, for each S G S' there exists some S G S such that the parent of 
S is on nr(S). By assumption (ii), for each S G S, we have /3(S) > 2e“^|'logn]. Each S G S is 
going to “pay” for the error we make in the sum at the segments whose parents belong to n 7 ’(S). 

Let J* C I be an optimal solution to the interval selection problem. For each segment S G S, 
J* has at most 2 intervals that intersect S but are not contained in S. Therefore, for all S G S 
we have that 


|{jGr|jns/0}| < |{jGr|Jcs}| + 2 < /3(s) + 2. (1) 
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The segments in § are pairwise disjoint because in T none is a descendant of the other. 
This means that we can join solutions obtained inside the segments of S into a feasible solution. 
Combining this with hypothesis (ii) we get 

r| > ^/3(5) > |S|-2 e-iriognl. (2) 

ses 

For each 5 G S, the path nr(5) has at most [logn] vertices. Since each S' G S' has a parent 
in n 7 ’(S), for some 5 G S, we obtain from equation ([^ that 

|S'| < 2 [logn]-|S| < ^flogn] • = e-|r|. (3) 

Using that S{r) is the union of the segments in S' and equation a we obtain 

iri < ^ |{jGr I jn5/0}| 

Se§' 

< Y^{P{S) + 2) 

se§' 

= 2-|S'| + 

5eS' 

< 2e.|r| + ^/3(-S), 

Se§' 

where in the last inequality we used equation (©• Now we use that 

V5gS: 2-p{S) > f5{S) 

to conclude that 

in < 2 e-|n + ^/3(5) < 2 e-in + J]; 2-/3n)- 

SeS' seS' 

The second inequality that we want to show follows because |JI*| = a(I). □ 

We would like to find a set S' satisfying the hypothesis of Lemma However, the definition 
should be local: to know whether a segment S belongs to S' we should use only local information 
around S. The estimator f3{S) is not suitable. For example, it may happen that, for some 
segment S G S \ {S'(r)}, we have /3(7r(5)) < /3{S), which is counterintuitive and problematic. We 
introduce another estimate that is an 0(logn)-approximation but is monotone nondecreasing 
along paths to the root. 

For each segment iS G S we define 

7 (5) = |{5' G S I S' C 5 and 3/ G I s.t. I C S'}|. 

Thus, 7 (S) is the number of segments of S that are contained in S and contain some input 
interval. 

Lemma 7 . For all S G S, we have the following properties: 

(i) 7 (S) < 7(7I'(5')), i/s / S(r), 

(ii) 7 (S) < /3(S) • [logn], 
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(iii) 7 ( 5 ') > /3{S), and 


(iv) 7 ( 5 ) can be computed in 0 ( 7 ( 5 )) space using the portion of the stream after the first 
interval eontained in S. 

Proof. Property (i) is obvious from the definition because any S' contained in 5 is also contained 
in the parent 7r(5). 

For the rest of the proof, fix some 5 G S and define 

S' = {5' G S I 5' C 5 and 31 G I s.t. I C S'}. 

Note that 7(5) is the size of S'. Let Ts the subtree of T rooted at 5. 

For property (ii), note that Ts has at most [logn] levels. By the pigeonhole principle, there 
is some level L of Tg that contains at least 7 ( 5 )/[log n] different intervals of S'. The segments 
of S' contained in level L are disjoint, and each of them contains some intervals of I. Picking an 

interval from each S' G L, we get a subset of intervals from I that are pairwise disjoint, and thus 

/3(5) > 7(5)/[logn]. 

For property (iii), consider an optimal solution J* for the interval selection problem in 
{/ G I I / C 5}. Thus |JI*| = /3(5). For each interval J G J*, let 5(J) be the smallest 5 G S that 
contains J. Then 5(J) G S'. Note that J contains the middle point of 5(J), as otherwise there 
would be a smaller segment in S containing J. This implies that the segments S{J), J G JJ*, are 
all distinct. (However, they are not necessarily disjoint.) We then have 

7(5) = |s'i > |{5(j)|jGr}| = iri = /3(5). 

For property (iv), we store the elements of S' in a binary search tree. Whenever we obtain 
an interval I, we check whether the segments contained in 5 and containing I are already in the 
search tree and, if needed, update the structure. The space needed in a binary search tree is 
proportional to the number of elements stored and thus we need 0 ( 7 ( 5 )) space. □ 

A segment 5 of S, 5 / 5(r), is relevant if 

7(7r(5)) > 2e“^[logn]^ and 1 < 7 ( 5 ) < [logn]^. 

Let Srei C S be the set of relevant segments. If S^eZ is empty, then we take S^ez = {5(r)}. 

Because of Lemmaj^i), 7 (-) is nondecreasing along a root-to-leaf path in T. Using Lemmas]^ 
and[^ we obtain the following. 

Lemma 8. We have 



Proof. If 7(5(r)) < 2e“^[logn]^, then S^eZ = {5(r)} and the result is clear. Thus we can assume 
that 7(5(r)) > 2e“^[logn]^, which implies that 5(r) ^ S^eZ- 
Define 

So = {5 G S \ {5(r)} I 7 ( 5 ) = 0 and 7(7r(5)) > 2e“^ [logn]^}. 

First note that the segments of S^eZ U So form a disjoint union of 5(r). Indeed, for each 
elementary segment [i, i + 1) G S, there exists exactly one ancestor that is either relevant or in So- 
Lemma [^ii), the definition of relevant segment, and the fact 7(5(r)) > 2e“^[logn]^ imply that 

V5GSrezUSo: ^(7r(5)) > 7(7r(5))/[logn] > 2e“^[logn]. 

Therefore, the set S' = S^ez U So satisfies the conditions of Lemma Using that for all 5 G So 
we have 7(5) = ${S) = 0, we obtain the claimed inequalities. □ 
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Figure 7: Active segments because of an interval I. 


Let Nrei be the number of relevant segments. A segment S € E> is active if S' = S(r) or 
its parent contains some input interval. See Figure for an example. Let Nact be the number 
of active segments in S. We are going to estimate Nact, the ratio Nrei/Naci, and the average 
value of /3(S) over the relevant segments S G §reZ- With this, we will be able to estimate the 
sum considered in Lemma [H The next section describes how the estimations are obtained in the 
data streaming model. 


4.2 Algorithms in the streaming model 

For each interval /, we use cr§(I) for the sequence of segments from S that are active because of 
interval /, ordered non-increasingly by size. Thus, contains S{r) followed by the segments 

whose parents contain I. The selected ordering implies that a parent 7r{S) appears before S, 
for all S in the sequence crs(/). Note that cr§(/) has at most 2[logn] elements because T is 
balanced. 


Lemma 9. There is an algorithm in the data stream model that uses 0{s ^ + logn) space and 
computes a value Nact such that 


Pr 


\Nact - Nactl <e-N, 


act 


> 


11 

12 ' 


Proof. We estimate Nact using, as a black box, known results to estimate the number of distinct 
elements in a data stream. The stream of intervals I = Ii, I 2 , ■ ■ ■ defines a stream of segments 
a = (Ts(/i), iTs(/ 2 ), ... that is O(logn) times longer. The segments appearing in the stream cr 
are precisely the active segments. 

We have reduced the problem to the problem of how many distinct elements appear in a 
stream of segments from S. The result of Kane, Nelson and Woodruff 
uses + log |§|) = 0(e“^ + logn) space and computes a value Nact such that 


12 for distinct elements 


Pr 


(1 — £)Nact < Nact < (1 + S:)Nact 


> 


11 

12 ' 


Note that, to process an interval of the stream I, we have to process O(logn) segments of S. □ 

Lemma 10. There is an algorithm in the data stream model that uses 0(e“^log^n) space and 
computes a value Nrei such that 


Pr 


\Nrel — Nrel\ < £ ' Nrd 


10 

> —. 
“ 12 
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Proof. The idea is the following. We estimate Nact by Nact using Lemma We take a 
sample of active segments, and count how many of them are relevant. To get a representative 
sample, it will be important to use a lower bound on Nrei/Nact- With this we can estimate 
Nrei = [Nrei/Nact) ' Nact accurately. We next provide the details. 

In T, each relevant segment S' G S^eZ has 2'y{S') < 4e“^[logn]^ active segments below it 
and at most 2[logn] active segments whose parent is an ancestor of S'. This means that for 
each relevant segment there are at most 

de^^flogn]^ + 2|'logn] < de^^flogn]^ 


active segments. We obtain that 

Nrel ^ 1 _ £ 

Nact ~ 6e-i|"logn]2 6 [log n] 2' 


( 4 ) 


Fix any injective mapping b between S and \v?] that can be easily computed. For example, for 
each segment S = [x, y) we may take b{S) = n{x — 1) + (y — 1). Consider a family PL = 
of permutations [n?] —)• [n^] guaranteed by Lemmaj^ For each h G PL, the function hob gives an 
order among the elements of S. We use them to compute Ff-random samples among the active 
segments. 

Set k = [72[logn]^/(e^(l — e)]] = 0(e“^ log^ n), and choose permutations hi,... ,hk G PL 
uniformly and independently at random. For each permutation hj, where j = 1,... ,k, let Sj be 
the active segment of S that minimizes {hj o b){-). Thus 


Sj = arg mm 


{h,(5) 


S' G S is active 




The idea is that Sj is nearly a random active segment of §. Therefore, if we define the random 
variable 

X=\{jG{l,...,k} I Sj is relevant} I 

then Nrel/Nact is roughly X/k. Below we discuss the computation of X. To analyze the random 
variable X more precisely, let us define 


p = Pr [So is relevant!. 

hj&H 

Since Sj is selected among the active segments, the discussion after Lemma implies 

(1 — e)Nrel (1 + £)Nrel 

^ . ■ 

In particular, using the estimate Q and the definition of k we get 


^ 72[logn]2 (l-£)jV^e; ^ 72[logn]2 


£3(1 -e) 


N 


act 


6 [ log n] 


12 

72 ■ 


( 5 ) 


( 6 ) 


Note that X is the sum of k independent random variables taking values in {0,1} and E[X] = kp. 
It follows from Chebyshev’s inequality and the lower bound in ® that 


Pr 


X 


P 


> ep 


= Pr 


\X — kp\ > ekp 


< 


kpjl-p) 1 ^ _ 

{ekpY kpe"^ ~ 12 


Var[X] 

[ekpY 

1 
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To finalize, let us define the estimator Nrei = -^act • (y) of A^reZi where Nact is the estimator 
of Nact given in Lemma When the events 


r - 1 



■ 

\Nact — Nact\ < sNact 

and 


< ep 


occur, then we can use equation ([^ and e < 1/2 to see that 

Nrel < (1 + £)Nact ' (1 + £)P < (1 + ^)‘^Nact ' - = (1 + ^)‘^Nrel 

< (1 + 7e)Nreh 

and also 

Nrel > (1 - £)Nact ' (1 “ £)P > (1 “ sfNact ' ^)^rel = (I _ efNrel 
> {l-7e)Nrel. 


We conclude that 


Pr 


(1 — 7e)Nrel < Nrel < (1 + 7£)Nrel 



r - 1 




> 1 — Pr 

\Nact — Nact\ > ^Nact 

-Pr 


> ep 


1 1 10 
12 ~ 12 “ 12 ' 


Replacing in the argument e by e/7, we obtain the desired bound. 

It remains to discuss how X can be computed. For each j, where j = 1,..., /c, we keep a 
variable that stores the current segment Sj for all the segments that are active so far, keep 
information about the choice of hj, and keep information about 7(5'^) and 'y{TT{Sj)), so that we 
can decide whether Sj is relevant. 

Let ... be the data stream of input intervals. We consider the stream of segments 

a = cr§(Ii), cj§(/ 2 )) • • • • When handling a segment S of the stream a, we may have to update 
Sj] this happens when hj{S) < hj{Sj). Note that we can indeed maintain 7(7r(5j)) because Sj 
becomes active the first time that its parent contains some input interval. This is also the first 
time when 'y{Tr(Sj)) becomes nonzero, and thus the forthcoming part of the stream has enough 
information to compute 'y{Sj) and 7(7r(5 j)). (Here it is convenient that gives segments in 

decreasing size.) To maintain 7(5j) and 7(7r(5 j)), we use Lemmaj^iv). 

To reduce the space used by each index j, we use the following simple trick. If at some point 
we detect that 'y{Sj) is larger than 2e~^ [logn]^, we just store that Sj is not relevant. If at some 
point we detect that 7(7r(5j)) is larger than 2e“^ [log n] we just store that 7r{Sj) is large enough 
that Sj could be relevant. We conclude that, for each j, we need at most 0(log(l/e) + e“^ log^ re) 
space. Therefore, we need in total 0(fee“^log^re) = 0(e“^log^re) space. □ 


Let 


P 


E 


/|Srez|- 


The next result shows how to estimate p. 


Lemma 11. There is an algorithm in the data stream model that uses 0(e ® log® re) space and 
computes a value p such that 


Pr 


\P- p\< ep 


> 


10 

12 ’ 
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Proof. Fix any injective mapping b between S and [n^] that can be easily computed, and consider 
a family PL = of permutations [n?] —)■ [n?] guaranteed by LemmaFor each h gV., 

the function hob gives an order among the elements of S. We use them to compute H-random 
samples among the active segments. 

Let §act be the set of active segments. 

Consider a random variable Yi defined as follows. We repeatedly sample hi gPL uniformly 
at random, until we get that argmin 5 gs^^j /3(<S') is a relevant segment. Let Si be the resulting 
relevant segment argmin 5 gs^^j ${S), and set Yi = ${Si). Because of Lemmaj^ where X = E>act 
and Y = Sreh 'w® have 


1 — As 1 + 4e 

V5 G §rel ■ < Pr[5i = S] < 


|Srez| 


We thus have 


E[yi] = ^ Pr[Si = 5] • ;S(S) < 


SGSrel 


SGSrel 


l + 4e 
|Srez| 


|§rez| 


•/3(S) = (l + 4e).p. 


and similarly 


E[yi] > ^ = (l-4e).p. 


S&Srel 


|§rez| 


For the variance we can use f3{S) < 7(5') and the definition of relevant segments to get 


Var[yi] < E[yi2] 


< 

< 

< 


Pr[5i = 5] • 0iS)f 

S^Srel 


2(1 + 4e) [logn]^ 


6 [ log n 


■ P 


■ P 


21 " log n] ^ 
e 


Note also that 7 ( 5 ) > 1 implies /3(5) > 1. Therefore, p>l. 

Consider an integer k to be chosen later. Let Y 2 ,... ,Yi. be independent random variables 
with the same distribution that Ll, and define p = {Yli=i Yi)lk. Using Chebyshev’s inequality 
and p > 1 we obtain 


Pr[|p-E[yi]| >ep] = 


< 


Pr 


|pA:-E[yi]A;| > ekp 


< 


fcVariy] 

[ekpY ^ ke'^p^ 
61 " log n] ^ 
ke^ 


Var[p/c] 

{ekpY 


Setting A; = 6 • 12 • [lognj^/e^, we have 


Pr[|p-E[yi]|>ep] < 1. 
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We then proceed similar to the proof of Lemma 10 Set ko = 12[logn]^fc/e(l — e) = 
0(e“^log^n). For each j G [fco], take a function hj G H uniformly at random and select 
Sj = argmin{/i(6(5)) | S is active}. Let X be the number of relevant segments in Si,..., Sk^ 
and let p = Pr[S'i G Srei]- Using the analysis of Lemma 10 we have 

^ (12[log n\^)k _ (1 - s)Nrei ^ ^ _ 12(log _ (1 - £)£ ^ 


and 


Pr 


£(1 -e) 


\X - kop\ > kQp/2 


N, 


act 


< 


Var[X] 


Akop{l -p) 


61" log n] 


< 


4 1 

< — < — . 


{kopl2Y k^p^ ' kop 2k 12 


This means that, with probability at least 11/12, the sample Si,..., Sk^ contains at least 
(l/2)A:op > k relevant segments. We can then use the first k of those relevant segments to 
compute the estimate p. 

With probability at least 1 — 1/12 — 1/12 = 10/12 we have the events 


1^ - kop\ > kop/2 


and 


-E[yi]| >ep]. 


In such a case 
and similarly 
Therefore, 


p < e /0 + E[Yi] < ep+{l + Ae)p = (l + 5e)/? 
p > E[Yi] — ep > (1 — 4e)/9 — ep = (1 — 5e)/3. 


Pr 


\p- p\< 5ep 


10 

> — . 

“ 12 


Changing the role of e and e/5, the claimed probability is obtained. 

It remains to show that we can compute p in the data stream model. Like before, for each 
j G [ko], we have to maintain the segment Sj, information about the choice of the permutation hj, 
information about 'y{Sj) and 'y{7r{Sj)), and the value (3{Sj). Since /3(Sj) < (3{Sj) < 'y{Sj) because 
of Lemmahi), we need 0(e“^ log^ n) space per index j. In total we need 0{koe~^ log^ n) = 
0(e“® log® n) space. □ 

Theorem 12. Assume that £ G (0,1/2) and let I be a set of intervals with endpoints in {1,..., nj 
that arrive in a data stream. There is a data stream algorithm that uses log® n) space and 

computes a value a such that 


Pr 


i (1 — e) • a(I) < a < a(I) 


2 

> 

“ 3 


Proof. We compute the estimate N^ei of Lemma 10 and the estimate p of Lemma 
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Define the 


estimate do = ^rei ' P- With probability at least 1~T2~T2 = s^® have the events 


\Afrel — ^rel\ < £ ' ^rel 


and 


\p- p\< £P 


When such events hold, we can use the definitions of Nrei and p, together with Lemma to 
obtain 

do < (1 + £)-^re« ■ (1 + £)p = (l+£)^ — (l+£)^a(I) 

SSSrel 

and 

\2 'W \ n 


do > (1 - £)Nrei ■ {1 - e)p = (l-£)^ ^ /3(5) > (1 - e)^ ( ^ - e ) 0(1). 

SSSrel 


1 
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Therefore, 


Pr 


(1 - e)^ - - e • «(!) < do < (1 + ■ a(I) 


2 

> 

“ 3 


Using that (1 — e)^(l/2 — e)/(l + e)^ > 1/2 — 3e for all e G (0,1/2), rescaling e by 1/6, and 
setting d = do/(l + e)^, the claimed approximation is obtained. The space bounds are those 
from Lemmas uniandini □ 


5 Largest independent set of same-size intervals 

In this section we show how to obtain a (3/2)-approximation to the largest independent set 
using 0(a(I)) space in the special case when all the intervals have the same length A > 0. 

Our approach is based on using the shifting technique of Hochbaum and Mass with a 
grid of length 3A and shifts of length A. We observe that we can maintain an optimal solution 
restricted to a window of length 3A because at most two disjoint intervals of length A can fit in. 

For any real value I, let denote the window + 3A). Note that includes the left 
endpoint but excludes the right endpoint. For a G {0,1, 2}, we define the partition of the real 
line 

Wa = {l^(a+3i)A|jeZ}. 

For a G {0,1,2}, let be the set of input intervals contained in some window of Wq. Thus, 

]I(j = G I I G Z s.t. (o + 3j)A G . 

Lemma 13. If all the intervals of I have length A > 0, then 

max|a(Io),a(I[i),a(I[2)| > 

Proof. Each interval of length A is contained in exactly two windows of Wq U Wi U W 2 . Let 
J* C I be a largest independent set of intervals, so that |JI*| = a(I). We then have 

3-max|a(Io),a(Ii),a(l2)} > ^ irnlj > 2|r| = 2 a(I) 

0<a<2 0<a<2 

and the result follows. □ 

For each a G {0,1, 2} we store an optimal solution Jq restricted to la- We obtain a (3/2)- 
approximation by returning the largest among Jq; Ji) J 2 - 

For each window W considered through the algorithm, we store LeftmostfW) and Rightmost{W). 
We also store a boolean value activefW) telling whether some previous interval was contained 
in W. When active{W) is false, LeftmostiW) and RightmostiW) are undefined. 

With a few local operations, we can handle the insertion of new intervals in I. For a window 
W G Wa, there are two relevant moments when Ja may be changed. First, when W gets the 
first interval, the interval has to be added to Ja and aetivefW) is set to true. Second, when 
W can first fit two disjoint intervals, then those two intervals have to be added to Ja- See the 
pseudocode in Figure for a more detailed description. 

Lemma 14. For a = 0,1, 2, the policy described in Figure\^ maintains an optimal solution Sa 
for the intervals la- 
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Process interval [x, y] of length A 

1 . for a = 0,1, 2 do 

2. W -(r- window of that contains x 

3. if 7/ G hh then (* [x, y] is contained in the window W G Wq *) 

4. if active{W) false then 

5. activeiW) •(— true 

6. Rightmost{W) ■(— [x,y] 

7. Leftmost{W) •(— [x,y] 

8 . add [x, y] to 

9. else if Rightmost(W) n Leftmost{W) ^ 0 then 

10. [£, r] ^ Rightmost{W) n Leftmost{W) 

11. if f < X then Rightmost{W) ^ [x,y] 

12. if y <r then Leftmost{W) •(— [x,y] 

13. if Rightmost{W) n LeftmostiW) = 0 then 

14. remove from the interval contained in W 

15. add to Ja intervals RightmostiW) and LeftmostiW) 


Figure 8: Policy to process a new interval [x,x + A]. Jq maintains an optimal solution for a(]Ia)- 


Proof. Since a window W of length 3 can contain at most 2 disjoint intervals of length A, the 
intervals RightmostiW) and LeftmostiW) suffice to obtain an optimal solution restricted to 
intervals contained in W. By the definition of 1^) an optimal solution for 

{RightmostiW), LeftmostiW)} 

W£Wa 

is an optimal solution for Iq. Since the algorithm maintains such an optimal solution, the claim 
follows. □ 

Since each window can have at most two disjoint intervals and each interval is contained in 
at most two windows of Wq U Wi U W 2 , we have at most 0(a(I)) active intervals through the 
entire stream. Using a dynamic binary search tree for the active windows, we can perform the 
operations in 0(logQ:(I)) time. We summarize. 

Theorem 15. Let I be a set of intervals of length A in the real line that arrive in a data stream. 
There is a data stream algorithm to compute a i^/2)-approximation to the largest independent 
subset of I that uses 0(a(I)) .space and handles each interval of the stream in 0(logQ;(I)) time. 

6 Size of largest independent set for same-size intervals 

In this section we show how to obtain a randomized estimate of the value a (I) in the special 
case when all the intervals have the same length A > 0. We assume that the endpoints are in [re]. 

The idea is an extension of the idea used in Section For a = 0,1, 2, let Wa and Iq be as 
defined in Section]^ For a = 0,1, 2, we will compute a value da that (1 + e)-approximates a(]Ia) 
with reasonable probability. We then return d = max{do, di, d 2 }, which catches a fraction at 
least |(1 — e) of a(I[), with reasonable probability. 

To obtain the (1 + e)-approximation to a(Ia); we want to estimate how many windows of 
Wa contain some input interval and how many contain two disjoint input intervals. For this 
we combine known results for the distinct elements as a black box and use sampling over the 
windows that contain some input interval. 
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Lemma 16. Let a he Q, 1 or 2 and let e G (0,1). There is an algorithm in the data stream 
model that uses log(l/e) + logn) space and computes a value oia such that 


Pr 


|a(Ia) - ttal < e • a(Ia) 



Proof. Let us fix some a G {0,1, 2}. We say that a window W of Wa is of type i \iW contains 
at least i disjoint input intervals. Since the windows of Wa have length 3A, they can be of type 
0, 1 or 2. For f = 0,1, 2, let 7* be the number of windows of type i in Wa- Then a{Ia) = 7i + 72- 
We compute an estimate 71 to 71 as follows. We have to estimate \ {W G Wa | 3 / s.t. I C W}\. 
The stream of intervals I = Ii, I2) • • • defines the sequence of windows W{J) = W{Ii),W{I 2 ), ■ ■ ■ 
where W{Ii) denotes the window of Wa that contains Ip, if I* is not contained in any window of 
Wa, we then skip p. Then 71 is the number of distinct elements in the sequence WiV). The 
results of Kane, Nelson and Woodruff 12 imply that using 0(e“^ + logn) space we can compute 
a value 71 such that 

Pr [(1 - e)7i < 71 < (1 + e)7i] > 


We next explain how to estimate the ratio 72/71 < 1. Consider a family % = TL{n,£) of 
permutations [n] —)■ [n] guaranteed by Lemma set k = and choose permutations 

hi,..., hi: £ T~L uniformly and independently at random. For each permutation hj, where 
j = 1,... ,k, let Wj be the window \i,l + 3A) of Wa that contains some input interval and 
minimizes hj{(). Thus 


Wj = argmin|/ij(f) | [1,1 + 3A) G Wa, some I G I is contained in [1,1 + 3A)|. 


The idea is that Wj is a nearly-uniform random window of Wa, among those that contain some 
input interval. Therefore, if we define the random variable 


M = |{j G {1,..., A:} I Wj is of type 2}| 

then 72/71 is roughly M/k. Below we make a precise analysis. 

Let us first discuss that M can be computed in space within 0(/clog(l/e)) = 0(e“^ log(l/e)). 
For each j, where j = 1,... ,k, we keep information about the choice of hj, keep a variable that 
stores the current window Wj for all the intervals that have been seen so far, and store the 
intervals Rightmost{Wj) and LeftmostiWj). Those two intervals tell us whether Wj is of type 
1 or 2. When handling an interval I of the stream, we may have to update Wj', this happens 
when hj[s) < hj{sj), where s is the left endpoint of the window of Wa that contains I and Sj is 
the left endpoint of Wj. When Wj is changed, we also have to reset the intervals Rightmost (Wj) 
and Leftmost (Wj) to the new interval I. 

To analyze the random variable M more precisely, let us dehne 


P 


Pr \Wj is of type 2] G 

hje^'hi 


(l-£)72 (l+£)72 

71 ’ 71 


Note that M is the sum of k independent random variables taking values in {0,1} 
It follows from Chebyshev’s inequality that 


Pr 


M 


p 


> £ 


= Pr 


\M — kp\ > £k 


^ Var[M] ^ 


kp 




< 


1 

£‘^k 


To hnalize, let us define the estimator da = 71 (l + ^). When the events 


[(1 - e)7i < 71 < (1 + e)7i] and 


M 


< e 


and E[M] = kp. 


< 


1 

18' 
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occur, then we have 


Oa < 


(l + e)7 i(l+p) < (1+ e)7i (l+ £ + 

V 71 

(1+ £)^(7i+72) < (l + 3e)a(Ia), 


and also 


da > (1 -e)7i (1 -£ + p) > (1 - e)7i (^1 - e + 
= (1-e)^(7i+ 72 ) > (l-3£)a(Ia)- 


(1 - e)72 


71 


We conclude that 


Pr 


(1 - 3e)a(Ia) < da < (1 + 3e)Q!(Ia) 


1 1 8 

> 1 -> 

18 18 “ 9 


Replacing in the argument e by s/3, the result follows. □ 

Theorem 17. Assume that e G (0,1/2) and let 1 he a set of intervals of length A with end¬ 
points in {1,..., n} that arrive in a data stream. There is a data stream algorithm that uses 
0(e“^ log(l/£) + logn) space and computes a value d such that 



e) ■ a(I) < d < a(I) 


> 


2 

3' 


Proof. For each a = 0,1, 2 we compute the estimate da to a(I[a) with the algorithm described in 
Lemma [ini We then have 


Pr 


A 

a=0,l,2 


a(Ia) 




< e • a(Ia) 


> 


6 

9’ 


When the event occurs, it follows by Lemma [T^ that 

|(1 — e) • a(I) < maxido, di, d2} < (1 + e)a(]I). 


Therefore, using that (1 —e)/(l + e) > 1 —2e for all s G (0,1/2), rescaling s by 1/2, and returning 
a = max{do) di, d2 }/(l + s), the result is achieved. □ 


7 Lower bounds 

Emek, Halldorsson and Rosen showed that any streaming algorithm for the interval selection 
problem cannot achieve an approximation ratio of r, for any constant r < 2. They also show 
that, for same-size intervals, one cannot obtain an approximation ratio below 3/2. We are going 
to show that similar inapproximability results hold for estimating a(I). 

The lower bound of Emek et al. uses the coordinates of the endpoints to recover information 
about a permutation. That is, given a solution to the interval selection problem, they can use 
the endpoints of the intervals to recover information. Thus, such reduction cannot be adapted 
to the estimation of q;(I), since we do not require to return intervals. Nevertheless, there are 
certain similarities between their construction and ours, especially for same-length intervals. 

Consider the problem Index. The input to Index is a pair {S,i) G {0,1}"" x [n] and the 
output, denoted by Index(S', i), is the i-th bit of S. One can think of S' as a subset of [n], and 
then Index(5, i) is asking whether element i is in the subset or not. 


21 













i 2 5 ^ ^ I'o ^ ^ 15 ^ ^ 20 ^ ^ 25 ^ ^ 30 ^ ^4 

Figure 9: Example showing a{S, i) for re = 7, 5 = {1,3,4, 6}, L = 9, and i = 2 in the proof of 
Theorem |18[ The intervals are sorted from bottom to top in the order they appear in the data 
stream. The empty dots represent endpoints that are not included in the interval, while the full 
dots represent endpoints included in the interval. 


The one-way communication complexity of Index is well understood. In this scenario, Alice 
has S and Bob has i. Alice sends a message to Bob and then Bob has to compute Index(S', z). 
The key question is how long should be the message in the worst case so that Bob can compute 
Index(S', z) correctly with probability greater than 1/2, say, at least 2/3. (Attaining probability 
1/2 is of course trivial.) It is known that, to achieve this, the message of Alice has n(re) bits in 
the worst case. See 111] for a short proof and 15 for a comprehensive treatment. 


Given an input (S', i) for Index, we can build a data stream of same-length intervals I with 
the property that q;(I) G {2,3} and Index(S, z) = 1 if and only if q;(I) = 3. Moreover, the first 
part of the stream depends only on S and the second part on z. Thus, the state of the memory 
at the end of the hrst part can be interpreted as a message that Alice sends to Bob. This implies 
the following lower bound. 


Theorem 18. Let c > 0 be an arbitrary constant. Consider the problem of estimating a(I) for 
sets of same-length intervals I with endpoints in [re]. In the data streaming model, there is no 
algorithm that uses o{n) bits of memory and computes an estimate a such that 


Pr 



a(I) < a < a(I) 



(7) 


Proof. For simplicity, we use intervals with endpoints in [3re] and mix closed and open intervals 
in the proof. 

Given an input (S, z) for Index, consider the following stream of intervals. We set L to some 
large enough value; for example L = re -f 2 will be enough. Let cri(5) be a stream that, for each 
j G 5, contains the closed interval [L + j, 2L + j]. Let cr2(z) be the length-two stream with open 
intervals (z, L -\- i) and {2L -\- i,2,L + i). Finally, let ct(5, z) be the concatenation of (Ti( 5) and 
<J 2 {i). See Figure]^ for an example. Let I be the intervals in a{S,i). It is straightforward to see 
that a(I) is 2 or 3. Moreover, a(I) = 3 if and only if Index(5, z) = 1. 

Assume, for the sake of contradiction, that we have an algorithm in the data streaming 
model that uses o(re) bits of space and computes a value a that satisfies equation ([^. Then, 
Alice and Bob can solve Index(5, i) using o(re) bits, as follows. 

Alice simulates the data stream algorithm on (Ti(S') and sends to Bob a message encoding the 
state of the memory at the end of processing cJi(S). The message of Alice has o(re) bits. Then, 
Bob continues the simulation on the last two items of cr{S, i), that is, (T2 (z). Bob has correctly 
computed the output of the algorithm on a{S, i), and therefore obtains a so that equation Q 
is satished. If d > 2, then Bob returns the bit /3 = 1. If d < 2, then Bob returns /3 = 0. This 
finishes the description of the protocol. 

Consider the case when Index(S', z) = 1. In that case, a(I) = 3. With probability at least 
2/3, the value d computed satishes d > (| -|- c)q;(I) = 2 -|- 3c > 2, and therefore 


Pr 


/3 


1 I Index(5, z) = 1 


= Pr [d > 2 I Index(S', z) = 1] 

> Pr [d > (I -|- c) q;(I) | Index(S', z) = l] 

> 2/3. 
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Figure 10: Example showing a{S, i) for n = 7, S = {1, 3,4,6}, L = 9, k = 3, and i = 2 in the 
proof of Theorem |19[ The intervals are sorted from bottom to top in the order they appear 
in the data stream. The empty dots represent endpoints that are not included in the interval, 
while the full dots represent endpoints included in the interval. 


When Index(S', i) = 0, then a(I) = 2, and, with probability at least 2/3, the value a computed 
satisfies a < q;(I) = 2. Therefore, 


Pr 


/3 


0 I Index(S', i) = 0 


= Pr [d < 2 I Index(5, i) = 0] 

> Pr [d < a(I) I Index(5, i) = 0] 

> 2/3. 


We conclude that 


Pr 


^ = Index( 5, z) 


> 2/3. 


Since Bob computes f3 after a message from Alice with o(n) bits, this contradicts the lower 
bound for INDEX. □ 


For intervals of different sizes, we can use an alternative construction with the property that 
a(I) is either A; + 1 or 2A: + 1. This means that we cannot get an approximation ratio arbitrarily 
close to 2. 


Theorem 19. Let c > 0 be an arbitrary constant. Consider the problem of estimating a(I) for 
sets of intervals I with endpoints in [n]. In the data streaming model, there is no algorithm that 
uses o{n) bits of memory and computes an estimate a such that 


Pr 



a(I) < d < a(I) 


> 


2 

3' 


Proof. Let A: be a constant larger than 1/c. For simplicity, we will use intervals with endpoints 
in [2A:n]. 

Given an input (S', i) for Index, consider the following stream of intervals. We set L to some 
large enough value; for example L = n + 2 will be enough. Let o'i(S) be a stream that, for each 
j £ S, contains the k open intervals {j, L + j), {j + L, j + 2L), ..., (j + (A: — 1)L, j + kL). Thus 
cri(S) has exactly A:|S| intervals. Let a 2 {i) be the stream with A: + 1 zero-length closed intervals 
[i, i], [i + L, i + L],..., [i + kL, i + kL]. Finally, let a{S, i) be the concatenation of cri(S) and 
(72(f). See Figure [l0| for an example. Let I be the intervals in ct(S, i). It is straightforward to see 
that 0(1) is A: -|- 1 or 2A: -|- 1: The greedy left-to-right optimum contains either all the intervals of 
(72(f) or those together with k intervals from (7i(S). This means that «(!) = 2A: -|- 1 if and only 
if Index(S', i) = 1. 

Alice simulates the data stream 


We use a protocol similar to that of the proof of Theorem 18 


algorithm on cti(S'), Bob receives the data message from Alice and continues the algorithm on 
(72(•S'), and Bob returns bit /3 = 1 if an only if d > A: + 1. With the same argument, and using 
the fact that + c){2k + 1) > A: -|- 1 by the choice of k, we can prove that using o{kn) = o{n) 
bits of memory one cannot distinguish, with probability at least 2/3, whether a (I) = A; -|- 1 or 
q:(I) = 2A; -|- 1. The result follows. □ 
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